{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas - Optimize Memory and Speed Operation\n", "\n", "- Watch [Other Interesting Data Science Topics](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA/videos)\n", "- Created By: **Aakash Goel**\n", "- Connect on [LinkedIn](https://www.linkedin.com/in/aakash-goel-587a7385/)\n", "- Subscribe on [YouTube](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA?sub_confirmation=1)\n", "- Created on: 17-FEB-2020\n", "- Last Updated on: 06-JUL-2020" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of contents\n", "\n", "1. See Hardware and System Information \n", " 1.1. System, CPU, Memory, Swap, Disk, GPU Information \n", "2. Reduce DataFrame size \n", " 2.1. Change in int datatype \n", " 2.2. Change in float datatype \n", " 2.3. Change from object to category datatype \n", " 2.4. Convert to Sparse DataFrame \n", "\n", "3. Reduce DataFrame Loading Time \n", " 3.1. Don't Load all columns \n", " 3.2. Read Large CSV File (Avoid Memory error and fast reading) \n", "\n", "4. Fast Pandas Operation \n", " 4.1. Loop and Vectorization \n", " 4.2. Indexing \n", " 4.3. Parallelization \n", " \n", "5. References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](datatypeMemory.PNG \"segment\")\n", "\n", " [1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](memory_nos.png \"segment\")\n", " [8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. See Hardware and System Information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is very important to know about your machine on which you are coding.\n", "Hardware and System Information helps you to take proper decision like how much `cores` you should allocate to your program.\n", "Code for same is taken from here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 System, CPU, Memory, Swap, Disk, GPU Information" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================== Python Version ========================================\n", "3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]\n", "sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)\n", "======================================== System Information ========================================\n", "System: Windows\n", "Release: 10\n", "Version: 10.0.18362\n", "Machine: AMD64\n", "Processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel\n", "======================================== CPU Info ========================================\n", "Physical cores: 2\n", "Total cores: 4\n", "Max Frequency: 2000.00Mhz\n", "Min Frequency: 0.00Mhz\n", "Current Frequency: 2000.00Mhz\n", "CPU Usage Per Core:\n", "Core 0: 56.9%\n", "Core 1: 25.0%\n", "Core 2: 43.8%\n", "Core 3: 42.2%\n", "Total CPU Usage: 45.2%\n", "======================================== Memory Information ========================================\n", "Total: 7.91GB\n", "Available: 3.85GB\n", "Used: 4.06GB\n", "Percentage: 51.3%\n", "======================================== SWAP ========================================\n", "Total: 19.26GB\n", "Free: 13.41GB\n", "Used: 5.85GB\n", "Percentage: 30.4%\n", "======================================== Disk Information ========================================\n", "Partitions and Usage:\n", "=== Device: C:\\ ===\n", " Mountpoint: C:\\\n", " File system type: NTFS\n", " Total Size: 465.13GB\n", " Used: 65.29GB\n", " Free: 399.84GB\n", " Percentage: 14.0%\n", "Total read: 540.46GB\n", "Total write: 62.73GB\n", "======================================== GPU Details ========================================\n", "id name load free memory used memory total memory temperature uuid\n", "---- ------ ------ ------------- ------------- -------------- ------------- ------\n" ] } ], "source": [ "# ! pip install tabulate\n", "# ! pip install gputil\n", "import psutil\n", "import platform\n", "import sys\n", "from datetime import datetime\n", "import GPUtil\n", "from tabulate import tabulate\n", "\n", "def get_size(bytes, suffix=\"B\"):\n", " \"\"\"\n", " Scale bytes to its proper format\n", " e.g:\n", " 1253656 => '1.20MB'\n", " 1253656678 => '1.17GB'\n", " \"\"\"\n", " factor = 1024\n", " for unit in [\"\", \"K\", \"M\", \"G\", \"T\", \"P\"]:\n", " if bytes < factor:\n", " return f\"{bytes:.2f}{unit}{suffix}\"\n", " bytes /= factor\n", "\n", "#### CODE -- https://www.thepythoncode.com/article/get-hardware-system-information-python ####\n", "\n", "print(\"=\"*40, \"Python Version\", \"=\"*40)\n", "print (sys.version)\n", "print (sys.version_info)\n", "\n", "print(\"=\"*40, \"System Information\", \"=\"*40)\n", "uname = platform.uname()\n", "print(f\"System: {uname.system}\")\n", "print(f\"Release: {uname.release}\")\n", "print(f\"Version: {uname.version}\")\n", "print(f\"Machine: {uname.machine}\")\n", "print(f\"Processor: {uname.processor}\")\n", "\n", "# let's print CPU information\n", "print(\"=\"*40, \"CPU Info\", \"=\"*40)\n", "# number of cores\n", "print(\"Physical cores:\", psutil.cpu_count(logical=False))\n", "print(\"Total cores:\", psutil.cpu_count(logical=True))\n", "# CPU frequencies\n", "cpufreq = psutil.cpu_freq()\n", "print(f\"Max Frequency: {cpufreq.max:.2f}Mhz\")\n", "print(f\"Min Frequency: {cpufreq.min:.2f}Mhz\")\n", "print(f\"Current Frequency: {cpufreq.current:.2f}Mhz\")\n", "# CPU usage\n", "print(\"CPU Usage Per Core:\")\n", "for i, percentage in enumerate(psutil.cpu_percent(percpu=True, interval=1)):\n", " print(f\"Core {i}: {percentage}%\")\n", "print(f\"Total CPU Usage: {psutil.cpu_percent()}%\")\n", "\n", "# Memory Information\n", "print(\"=\"*40, \"Memory Information\", \"=\"*40)\n", "# get the memory details\n", "svmem = psutil.virtual_memory()\n", "print(f\"Total: {get_size(svmem.total)}\")\n", "print(f\"Available: {get_size(svmem.available)}\")\n", "print(f\"Used: {get_size(svmem.used)}\")\n", "print(f\"Percentage: {svmem.percent}%\")\n", "print(\"=\"*40, \"SWAP\", \"=\"*40)\n", "# get the swap memory details (if exists)\n", "swap = psutil.swap_memory()\n", "print(f\"Total: {get_size(swap.total)}\")\n", "print(f\"Free: {get_size(swap.free)}\")\n", "print(f\"Used: {get_size(swap.used)}\")\n", "print(f\"Percentage: {swap.percent}%\")\n", "\n", "# Disk Information\n", "print(\"=\"*40, \"Disk Information\", \"=\"*40)\n", "print(\"Partitions and Usage:\")\n", "# get all disk partitions\n", "partitions = psutil.disk_partitions()\n", "for partition in partitions:\n", " print(f\"=== Device: {partition.device} ===\")\n", " print(f\" Mountpoint: {partition.mountpoint}\")\n", " print(f\" File system type: {partition.fstype}\")\n", " try:\n", " partition_usage = psutil.disk_usage(partition.mountpoint)\n", " except PermissionError:\n", " # this can be catched due to the disk that\n", " # isn't ready\n", " continue\n", " print(f\" Total Size: {get_size(partition_usage.total)}\")\n", " print(f\" Used: {get_size(partition_usage.used)}\")\n", " print(f\" Free: {get_size(partition_usage.free)}\")\n", " print(f\" Percentage: {partition_usage.percent}%\")\n", "# get IO statistics since boot\n", "disk_io = psutil.disk_io_counters()\n", "print(f\"Total read: {get_size(disk_io.read_bytes)}\")\n", "print(f\"Total write: {get_size(disk_io.write_bytes)}\")\n", "\n", "# GPU information\n", "print(\"=\"*40, \"GPU Details\", \"=\"*40)\n", "gpus = GPUtil.getGPUs()\n", "list_gpus = []\n", "for gpu in gpus:\n", " # get the GPU id\n", " gpu_id = gpu.id\n", " # name of GPU\n", " gpu_name = gpu.name\n", " # get % percentage of GPU usage of that GPU\n", " gpu_load = f\"{gpu.load*100}%\"\n", " # get free memory in MB format\n", " gpu_free_memory = f\"{gpu.memoryFree}MB\"\n", " # get used memory\n", " gpu_used_memory = f\"{gpu.memoryUsed}MB\"\n", " # get total memory\n", " gpu_total_memory = f\"{gpu.memoryTotal}MB\"\n", " # get GPU temperature in Celsius\n", " gpu_temperature = f\"{gpu.temperature} °C\"\n", " gpu_uuid = gpu.uuid\n", " list_gpus.append((\n", " gpu_id, gpu_name, gpu_load, gpu_free_memory, gpu_used_memory,\n", " gpu_total_memory, gpu_temperature, gpu_uuid\n", " ))\n", "\n", "print(tabulate(list_gpus, headers=(\"id\", \"name\", \"load\", \"free memory\", \"used memory\", \"total memory\",\n", " \"temperature\", \"uuid\")))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import nbconvert\n", "from time import time\n", "import gc\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Reduce DataFrame size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Change in int datatype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have `Age` column having `minimum value 1 and maximum value 150`, with `10 million` total rows in dataframe \n", "**Task**: Reduce Memory Usage of `Age` column given above constraints \n", "**Action**: Change of original dtype from `int32` to `uint8` \n", "**Result**: Drop from `38.1 MB to 9.5 MB` in Memory usage i.e. `75%` reduction" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "## Initializing minimum and maximum value of age\n", "min_age_value , max_age_value = 1,150\n", "## Number of rows in dataframe\n", "nrows = int(np.power(10,7))\n", "## creation of Age dataframe\n", "df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 1 columns):\n", "Age int32\n", "dtypes: int32(1)\n", "memory usage: 38.1 MB\n" ] } ], "source": [ "## check memory usage before action\n", "df_age.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "iinfo(min=0, max=255, dtype=uint8)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Range of \"uint8\"; satisfies range constraint of Age column \n", "np.iinfo('uint8')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "## Action: conversion of dtype from \"int32\" to \"uint8\"\n", "converted_df_age = df_age.astype(np.uint8)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 1 columns):\n", "Age uint8\n", "dtypes: uint8(1)\n", "memory usage: 9.5 MB\n" ] } ], "source": [ "## check memory usage after action\n", "converted_df_age.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Change in float datatype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have `50,000 search queries` and `5,000 documents` and computed `cosine similarity` for each search query with all documents i.e. `dimension 50,000 X 5,000`. All similarity values are between `0 and 1` and should have atleast `2 decimal precision` \n", "**Task**: Reduce Memory Usage of cosine smilarity dataframe given above constraints \n", "**Action**: Change of original dtype from `float64` to `float16` \n", "**Result**: Drop from `1.9 GB to 476.8 MB or 0.46 GB` in Memory usage i.e. `75%` reduction " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No. of search queries: 50000 and No. of documents: 5000\n" ] } ], "source": [ "## no. of documents\n", "ncols = int(5*np.power(10,3))\n", "## no. of search queries\n", "nrows = int(5*np.power(10,4))\n", "## creation of cosine similarity dataframe\n", "df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))\n", "print(\"No. of search queries: {} and No. of documents: {}\".format(df_query_doc.shape[0],df_query_doc.shape[1]))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Columns: 5000 entries, 0 to 4999\n", "dtypes: float64(5000)\n", "memory usage: 1.9 GB\n" ] } ], "source": [ "## check memory usage before action\n", "df_query_doc.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "## Action: conversion of dtype from \"float64\" to \"float16\"\n", "converted_df_query_doc = df_query_doc.astype('float16')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Columns: 5000 entries, 0 to 4999\n", "dtypes: float16(5000)\n", "memory usage: 476.8 MB\n" ] } ], "source": [ "## check memory usage after action\n", "converted_df_query_doc.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Change from object to category datatype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have `Day of Week` column having `7 unique` values, with `4.9 million` total rows in dataframe \n", "**Task**: Reduce Memory Usage of `Day of Week` column given only 7 unique value exist \n", "**Action**: Change of dtype from `object` to `category` as ratio of unique values to no. of rows is almost zero \n", "**Result**: Drop from `2.9 GB to 46.7 MB or 0.045 GB` in Memory usage i.e. `98%` reduction " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No of rows in days of week dataframe 49000000\n" ] } ], "source": [ "## unique values of \"days of week\"\n", "day_of_week = [\"monday\",\"tuesday\",\"wednesday\",\"thursday\",\"friday\",\"saturday\",\"sunday\"]\n", "## Number of times day_of_week repeats\n", "repeat_times = 7*np.power(10,6)\n", "## creation of days of week dataframe\n", "df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})\n", "print(\"No of rows in days of week dataframe {}\".format(df_day_of_week.shape[0]))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 49000000 entries, 0 to 48999999\n", "Data columns (total 1 columns):\n", "day_of_week object\n", "dtypes: object(1)\n", "memory usage: 2.9 GB\n" ] } ], "source": [ "## check memory usage before action\n", "df_day_of_week.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "## Action: conversion of dtype from \"object\" to \"category\"\n", "converted_df_day_of_week = df_day_of_week.astype('category')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 49000000 entries, 0 to 48999999\n", "Data columns (total 1 columns):\n", "day_of_week category\n", "dtypes: category(1)\n", "memory usage: 46.7 MB\n" ] } ], "source": [ "## check memory usage after action\n", "converted_df_day_of_week.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
day_of_week
0monday
1monday
\n", "
" ], "text/plain": [ " day_of_week\n", "0 monday\n", "1 monday" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## check first two rows of dataframe\n", "converted_df_day_of_week.head(2)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "dtype: int8" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## check how mapping of day_of_week is created in category dtype\n", "converted_df_day_of_week.head(2)['day_of_week'].cat.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Convert to Sparse DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have dataframe having `large count of zero or missing values (66%)` usually happens in lot of `NLP task` like Count/TF-IDF encoding, Recommender Systems [2] \n", "**Task**: Reduce Memory Usage of dataframe \n", "**Action**: Change of DataFrame type to `SparseDataFrame` as Percentage of Non-Zero Non-NaN values is very less in number \n", "**Result**: Drop from `228.9 MB to 152.6 MB` in Memory usage i.e. `33%` reduction " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "## number of rows in dataframe\n", "nrows = np.power(10,7)\n", "## creation of dataframe\n", "df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 3 columns):\n", "0 int64\n", "1 float64\n", "2 float64\n", "dtypes: float64(2), int64(1)\n", "memory usage: 228.9 MB\n" ] } ], "source": [ "## check memory usage before action\n", "df_dense.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Percentage of Non-Zero Non-NaN values in dataframe 33.33 %\n" ] } ], "source": [ "## Percentage of Non-zero and Non-NaN values in dataframe\n", "non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()\n", "non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)\n", "print(\"Percentage of Non-Zero Non-NaN values in dataframe {} %\".format(non_zero_non_nan_percentage))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "## Action: Change of DataFrame type to SparseDataFrame\n", "df_sparse = df_dense.to_sparse()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 3 columns):\n", "0 Sparse[int64, nan]\n", "1 Sparse[float64, nan]\n", "2 Sparse[float64, nan]\n", "dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)\n", "memory usage: 152.6 MB\n" ] } ], "source": [ "## check memory usage after action\n", "df_sparse.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Reduce DataFrame Loading Time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Don't Load all columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have csv file which have `100 columns` and you only need `10` columns for processing \n", "**Task**: Reduce `pd.read_csv()` Loading time \n", "**Action**: Only Load required columns explicitly i.e. only 10 columns not 100 columns which is default \n", "**Result**: Drop from `2.8 second to 0.8 second` in Loading time i.e. `71%` reduction " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default Loading time 2.868640422821045 seconds\n" ] } ], "source": [ "t0 = time()\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "t1 = time()\n", "print(\"Default Loading time {} seconds\".format(t1-t0))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default Loading time 0.8735167980194092 seconds\n" ] } ], "source": [ "t0 = time()\n", "ufo = pd.read_csv('http://bit.ly/uforeports',usecols=['City','State'])\n", "t1 = time()\n", "print(\"Default Loading time {} seconds\".format(t1-t0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Read Large CSV File" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have large csv file which is taking too much time in reading or giving memory error while reading in one go \n", "**Task**: Avoid Memory Error or reduce file loading time \n", "**Action 1**: Instead of loading file in one go, load file in chunks i.e. use `chunksize` parameter in read_csv. Useful when you don't need to load all data into memory at once \n", "**Action 2**: See if you can do memory saving by doing following operations:\n", "- Explicit pass appropriate `dtypes`for each column while loading file and reduce the volume of the data. Example - Sometimes `64-bit` precision is overkill and can use `32-bit`, 2x memory savings are worth it.\n", "- `Usecols` parameter in `read_csv` i.e. only load required columns\n", "- `Data compression` - Here, I don't mean by zip or gzip file. Say your status column has two values: \"AVAILABLE\" and \"UNAVAILABLE\". Instead of storing them as string (~ 10 bytes per entry), store as boolean: True or False (1 byte per entry)\n", "\n", "**Action 3**: For faster data loading, serialize data (pickle) on the first run and load the serialized data instead of the csv files in subsequent runs.[15]\n", "\n", "**Action 4**: Use of other Library instead of pandas but with similar functionality:\n", "- Dask - Dask provides multi-core execution on larger-than-memory datasets.\n", "\n", "\n", "- Modin - Modin transparently distributes the data and computation. Modin provides speed-ups of up to 4x on a laptop with 4 physical cores. Pandas able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](read_csv_benchmark.png \"segment\")\n", " [9]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](lib_comparison.png \"segment\")\n", " [11]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# # pip install modin[ray]\n", "# import ray\n", "# ray.init(num_cpus=4)\n", "# import modin.pandas as pd\n", "# export MODIN_OUT_OF_CORE=true" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](modin_vs_pandas.png \"segment\")\n", " [13]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Fast Pandas Operation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Loop and Vectorization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](pandas_operation.png \"segment\")\n", " [16]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](pandas_optimization.png \"segment\")\n", " [17]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you want to join two dataframes based on some common column \n", "**Task**: Reduce `pd.merge()` time \n", "**Action**: During `merge` operation, use indexing \n", "**Result**: Drop from `10 second to 2 second` in Loading time i.e. `80%` reduction \n", "\n", "\n", "**General Advice**: Use indexing as much as possible for merging and value lookup in dataframe [18] " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time Taken in merge before indexing 9.495047569274902\n", "Time Taken in merge after indexing 2.741480827331543\n" ] } ], "source": [ "df1 = pd.DataFrame({'A':list(range(10000000)),'B':list(range(0,20000000,2))})\n", "df2 = pd.DataFrame({'A':list(range(10000000)),'C':list(range(0,40000000,4))})\n", "\n", "t0 = time()\n", "df = pd.merge(df1,df2,how='left',on='A')\n", "t1 = time()\n", "print(\"Time Taken in merge before indexing {}\".format(t1-t0))\n", "\n", "gc.collect()\n", "\n", "t0 = time()\n", "df1_ = df1.set_index('A')\n", "df2_ = df2.set_index('A')\n", "df_ = pd.merge(df1_,df2_, left_index=True, right_index=True)\n", "t1 = time()\n", "print(\"Time Taken in merge after indexing {}\".format(t1-t0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Parallelization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**: Can we parallelize `pandas apply or groupby` function without much change in code ?\n", "\n", "\n", "**Answer**: Yes, there are lot of libraries which can help in parallelizing pandas apply function without much change in code.\n", "\n", "**Libraries**\n", "- pandarallel\n", "\n", "![SegmentLocal](standard_vs_parallel_4_cores.png \"segment\")\n", " [19]\n", "\n", "**General Advice:** Parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.\n", "\n", "- Swifter - It applies any function to a pandas dataframe or series in the fastest available manner (Vectorize OR Dask Parallel Processing OR Pandas apply). This notebook contains speed comparison for swifter, Please check." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "#########################################\n", "#### PANDARALLEL CODE #################\n", "#########################################\n", "# from pandarallel import pandarallel\n", "# pandarallel.initialize(nb_workers = _NO_CORES_USED_FOR_PARALLELIZATION,progress_bar = True)\n", "# df.parallel_apply(func)\n", "# df.groupby(args).parallel_apply(func)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "#########################################\n", "#### SWIFTER CODE #################\n", "#########################################\n", "# import pandas as pd\n", "# import swifter\n", "# df['Col'].swifter.apply(anyfunction)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. References\n", "\n", "1) [https://www.dataquest.io/blog/pandas-big-data/](https://www.dataquest.io/blog/pandas-big-data/) \n", "2) [https://machinelearningmastery.com/sparse-matrices-for-machine-learning/](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/) \n", "3) [https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe](https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe) \n", "4) [https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) \n", "5) [https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/](https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/) \n", "6) [https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) \n", "7) [https://pythonspeed.com/articles/pandas-load-less-data/](https://pythonspeed.com/articles/pandas-load-less-data/) \n", "8) [https://numpy.org/devdocs/user/basics.types.html](https://numpy.org/devdocs/user/basics.types.html) \n", "9) [https://modin.readthedocs.io/en/latest/](https://modin.readthedocs.io/en/latest/) \n", "10) **DASK** - [https://docs.dask.org/en/latest/best-practices.html](https://docs.dask.org/en/latest/best-practices.html)\n", "\n", "11) [https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c](https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c) \n", "12) [https://pythonspeed.com/datascience/](https://pythonspeed.com/datascience/) \n", "13) **Modin Vs Pandas** [https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html](https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html) \n", "14) [https://www.csvexplorer.com/blog/open-big-csv/](https://www.csvexplorer.com/blog/open-big-csv/) \n", "15) [https://www.kaggle.com/c/zillow-prize-1/discussion/37261](https://www.kaggle.com/c/zillow-prize-1/discussion/37261) \n", "16) [https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6) \n", "17) [https://github.com/s-heisler/pycon2017-optimizing-pandas/blob/master/MinneAnalytics%20materials/Minneanalytics_talk_slides.pdf](https://github.com/s-heisler/pycon2017-optimizing-pandas/blob/master/MinneAnalytics%20materials/Minneanalytics_talk_slides.pdf) \n", "18) [https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2](https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2) \n", "19) [https://github.com/nalepae/pandarallel](https://github.com/nalepae/pandarallel) \n", "20) [https://www.thepythoncode.com/article/get-hardware-system-information-python](https://www.thepythoncode.com/article/get-hardware-system-information-python) \n", "\n", "\n", "**NOTE:** I acknowledge the work of above references and this notebook uses some of code mentioned in references." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }